Refining epidemiological forecasts with simple scoring rules

Estimates from infectious disease models have constituted a significant part of the scientific evidence used to inform the response to the COVID-19 pandemic in the UK. These estimates can vary strikingly in their bias and variability. Epidemiological forecasts should be consistent with the observations that eventually materialize. We use simple scoring rules to refine the forecasts of a novel statistical model for multisource COVID-19 surveillance data by tuning its smoothness hyperparameter. This article is part of the theme issue ‘Technical challenges of modelling real-life epidemics and examples of overcoming these’.


Introduction
Several epidemiological modelling groups use statistical models of infectious disease to generate forecasts that contribute to a body of scientific evidence that informs the response to the COVID-19 pandemic in the UK. The models developed by the University of Cambridge MRC Biostatistics Unit and Public Health England (PHE) [1], the University of Warwick [2] and the London School of Hygiene & Tropical Medicine [3] provide three notable examples of statistical models used to produce such estimates.
Although Cramer et al. [4] and Funk et al. [5] consider the assessment of quantile-format forecasts for COVID-19, it does not appear to be standard practice to assess full-distribution epidemiological forecasts by comparing them to the observations that eventually materialize. Maishman et al. [6] provide a set of anonymized estimates for the effective reproduction number R t that highlights the striking differences in the bias and variability of estimates that different epidemiological models can produce. We contribute to the collective effort of modelling COVID-19 in this paper by introducing a statistical model for multisource COVID-19 surveillance data and by using simple scoring rules to refine its forecasts and improve its predictive performance. The statistical model is novel by its use of symptom report data from the NHS 111 telephone service, its compartments for convalescing and terminally ill individuals, and its implementation, which uses a bespoke numerical integrator to solve the system of ODEs for the transmission model.
We begin by describing the novel statistical model in §2 and then proceed to define a set of simple scoring rules in §3. In §4, we use the simple scoring rules to refine the forecasts of the statistical model. Finally, in §5, we bring the paper to an end by presenting our conclusion.

Statistical model
The model we describe in this section builds on a previous version whose implementation we contributed to the CoDatMo (Covid Data Models) organization on GitHub [7]. The purpose of CoDatMo is to provide a collection of COVID-19 models, all written in the statistical modelling language Stan [8]. In addition to models originally implemented in Stan, CoDatMo provides Stan instantiations of COVID-19 models, initially written in other programming languages. By hosting a set of well-documented models, all implemented in a common language, CoDatMo hopes to improve understanding of the set of existing models and make it easier for newcomers and established researchers within the domain of epidemiology to make extensions and potential improvements.
We note that CoDatMo has already had some success against its objectives, with a group, primarily based at Universidade Nove de Julho in Brazil, creating a related but simpler model [9]. We also note that the UK Health Security Agency uses a slightly more sophisticated version of the model presented in this paper to generate weekly estimates of the effective reproduction number and growth rate for regions of the UK [10].
The statistical model consists of two parts: a transmission model that captures a simplified mechanism for the spread of coronavirus through the population; and an observation model that encapsulates the assumptions about the connection between the states of the transmission model and the observed surveillance data used to calibrate the model.

(a) Transmission model
The simple SIR compartmental model developed by Kermack & McKendrick [11] provides the theoretical basis for the transmission model. As with the SIR model, we assume a single geographical region with a large population of identical individuals who come into contact with one another uniformly at random but do not come into contact with individuals from other areas. In contrast to some other models, for example, those developed by Birrell et al. [1] and Keeling et al. [2], we treat the population as identical in terms of age and sex and only discriminate between individuals based on their disease states. We also assume that the population is closed, meaning that no births or deaths occur and no migration in or out of the population occurs.
The SIR model divides the population into three disease states: individuals who are (S) susceptible to infection; individuals who have been infected with the disease and are (I) infectious to other people; and individuals who have (R) 'recovered' either by recuperating from the disease or dying. We augment these compartments in the transmission model by adding disease states for individuals who have been (E) exposed to the virus but are not yet infectious, individuals that are no longer infectious and whose final disease states are (P) pending, and individuals who have (D) died of the disease. The exposed and dead compartments are standard extensions to the original SIR model. By contrast, we believe the pending compartments to be a novel aspect of the model.  Figure 1. A graph of the transmission model. Individuals begin their journey in the susceptible (S) state. From here, they are infected and move into the exposed (E) state. After the virus has incubated for a while, they continue into the infectious (I) state. Next, they enter the pending (P) state, after which they either migrate into the recovered (R) state if convalescing or pass into the deceased (D) state if terminally ill. of infection. In addition to adding these compartments, we redefine the (R) recovered population to include the living only.
Inspired by the work of the University of Cambridge MRC Biostatistics Unit and PHE [1], we partition each of the intermediate disease states (E, I, P) into two sub-states. Partitioning the substates in this way makes the model more realistic by implicitly constraining the times spent in each of these disease states to have Erlang rather than exponential distributions.
We assume that there is at least one individual in each susceptible, exposed, and infectious compartment on 17 February 2020, which is the beginning of time in the model. At this stage, we assume that no population members are pending, have recovered, or have died. Two parameters, α 1 and α 2 , determine the allocation of the rest of the population to the first five compartments of the transmission model. α 1 is the proportion of the remaining population initially in the susceptible compartment, and α 2 is the proportion of the infected population that is not yet infectious at time zero. For simplicity, we divide the exposed and infectious populations equally between the respective sub-states. We can express the exact relationship between the two parameters, α 1 and α 2 , and the initial state of the transmission model as a set of equations We assume that the population randomly mixes as time elapses, with infectious and susceptible individuals coming into contact with one another, potentially transmitting the virus. Susceptible people who have become exposed through these contacts are not initially infectious. The virus replicates in their bodies for a time, known as the latent period, before they become royalsocietypublishing.org/journal/rsta Phil. Trans. R. Soc. A 380: infectious and have the potential to transmit the virus onto members of the remaining susceptible population. After being infectious for some time, we assume that individuals enter a state of pending before either recovering and becoming indefinitely immune to reinfection if they were convalescing or dying if they were terminally ill.
The number of individuals in each disease state varies with time according to a system of ordinary differential equations (ODEs) where -S(t) is the number of susceptible individuals who have not yet been infected and are at risk of infection, -E 1 (t) + E 2 (t) is the number of exposed individuals who have been infected but are not yet infectious, -I 1 (t) + I 2 (t) is the number of infectious individuals, -P 1 (t) + P 2 (t) is the number of pending individuals who are either convalescing or are terminally ill, -R(t) is the number of recovered individuals, -D(t) is the number of dead individuals, is the constant total number of individuals in the population, -d L is the mean time between infection and onset of infectiousness, -d I is the mean time for which individuals are infectious, -d P is the mean time for which individuals are pending, -ω, the infection fatality ratio (IFR), is the proportion of infected individuals who will die, -β(t) is the mean rate of contacts between individuals per unit time that are sufficient to lead to transmission if one of the individuals is infectious and the other is susceptible. β(t) is a continuous piecewise linear function of time where the mean rate of effective contacts during the jth time interval, β j (t), is given by The effective contact rate parameters β 1 , β 2 , . . . , β J+1 in equation (2.20) are the values β(t) takes on a set of predefined dates t 0 , t 1 , . . . , t J . The first date is 17 February 2020, and each date that follows is 7 days after the last, with the second date being 24 March 2020, the first day after the prime minister announced the first national lockdown.

(b) Observation model
Epidemiological modelling groups use many types of surveillance data to calibrate statistical models of infectious diseases. The observation model captures the assumptions about the relationship between the states of the transmission model and the surveillance data that we use for calibration.
We have designed the observation model to be extensible. Here, the model only has components for death, hospital admission and symptom report data to keep things simple. Nonetheless, the observation model can be extended to assimilate additional types of surveillance data, such as case data, by appending extra components similar in structure to those for ingesting the hospital admission and symptom report data.

(i) Death data
On their official website for coronavirus data [12], the UK government publishes a daily time series of the number of deaths of individuals whose death certificate mentioned COVID-19 as one of the causes. We assume that the number of deaths on day t, according to this definition, d obs (t), has a negative binomial distribution parameterized by a mean d(t) and parameter φ deaths which affects overdispersion where we use the alternative parameterization of the negative binomial distribution as defined by the Stan Development Team [13] NegativeBinomial (2.23) In equation (2.22), d(t) is the difference between the population of the D state of the transmission model between days t − 1 and t:

(ii) Hospital admission data
The UK government also publishes a daily time series of the number of COVID-19 patients admitted to hospital on their official website for coronavirus data [12]. We assume that, like the number of deaths, the number of hospital admissions on day t, h obs (t), has a negative binomial distribution parameterized by h(t) and φ admissions : where where the ratio of hospital admissions to potential patients during the kth time interval, ρ admissions,k (t), is given by 27) and the indicator function for the kth time interval, The parameters ρ admissions,1 , ρ admissions,2 , . . . , ρ admissions,K+1 in equation (2.27) are the values ρ admissions (t) takes on a set of predefined dates t 0 , t 1 , . . . , t K . The first date is 24 March 2020, and each date that follows is 12 weeks after the last, with the second date being 16 June 2020.
(iii) Symptom report data Every weekday up to the previous calendar day, NHS Digital publishes a daily time series of the number of assessments completed through the NHS 111 telephone service where callers reported potential coronavirus (COVID-19) symptoms [14]. Leclerc et al. found a strong correlation between the volume of these symptom reports and the number of COVID-19 deaths reported 16 days later [15]. We assume that, like the other types of surveillance data, the number of assessment calls to NHS 111 on day t where callers reported potential COVID-19 symptoms, c obs (t), has a negative binomial distribution parameterized by c(t) and φ calls i.e. the mean number of assessment calls to NHS 111 on day t where callers reported potential COVID-19 symptoms equals the ratio of symptom reports to potential symptom reporters, ρ calls (t), multiplied by the sum of the number of new members of the infectious and pending states.
ρ calls (t) is a continuous piecewise linear function of time almost identical to ρ admissions (t), which is defined by equations (2.26) and (2.27). The only difference is that the parameters ρ calls,1 , ρ calls,2 , . . . , ρ calls,L+1 are associated with a different set of predefined dates t 0 , t 1 , . . . , t L . The first of these dates is 24 March 2020, and each date that follows is four weeks after the last, with the second date being 16 March 2020.

Scoring rules
Scoring rules produce real numbers, also called numerical scores, that summarize the quality of probabilistic forecasts. More concretely, consider a probabilistic forecast P of an uncertain future quantity X for which the observation x eventually materializes. In a scenario such as this, a scoring rule provides a numerical score s(P, x) that quantifies the statistical consistency between the predictive distribution P and the observation x. Table 1 shows the simple scoring rules that we use in this paper.
The logarithmic, quadratic, spherical, ranked probability, Dawid-Sebastiani and squared error scores in table 1 are negatively oriented, with better forecasts typically resulting in lower scores. These scores are also said to be proper in the sense that a forecaster minimizes them when quoting their true belief. Proper scoring rules are considered essential for incentivising honest forecasting, a position argued by Gneiting & Raftery [20].  Table 1. The set of simple scoring rules that feature in this paper. In the definitions, p x is the probability mass of the predictive distribution for an observed count x, ||p|| 2 = ∞ k=0 p 2 k , P k is the value of cumulative predictive distribution for a count k, 1(.) is the indicator function, and μ P and σ 2 P are the mean and variance of the predictive distribution.
scoring rule definition reference logarithmic score logs(P, x) = − log p x Good [16] .  By contrast, the normalized squared error score is improper, a quality that has resulted in it being discredited, for example, by Czado et al. [18], as a tool for evaluating probabilistic forecasts. We argue that viewing it through the lens of propriety leads to an underappreciation of its unique properties, particularly its ability to distinguish between over-confidence and overcaution. Indeed, we see the normalized squared error score as a valuable diagnostic tool with advantages over proper scoring rules in certain situations.
Interestingly, the normalized squared error score is popular in the tracking and data fusion community, which has studied performance measures for evaluating estimation algorithms, such as Li & Zhao's absolute [22] and relative [23] error measures. Blasch et al. [24] and Chen et al. [25] describe a now-popular relative error measure called the normalized estimation error squared (NEES), which is a generalization of the normalized squared error score to multiple dimensions. Researchers in the community have used the NEES extensively as an easily understood approach to arguing the merits of, for example, different extensions of the Kalman filter to specific nonlinear settings where the extended Kalman filter is routinely over-confident [26].
Scores are typically reported as averages over multiple probabilistic forecasts, each for a distinct point in time. Following Czado et al. [18], we use uppercase to denote the mean score over several forecasts. The tables in this paper use the mean scores LogS, QS, SphS, RPS, DSS, SES and NSES.

Computational experiments (a) Set-up
We implement the statistical model described in §2 in Stan [8]. Stan is a probabilistic programming language that allows users to articulate statistical models and calibrate them with data using a Markov chain Monte Carlo method called the No-U-Turn sampler (NUTS), proposed by Hoffman & Gelman [27]. Stan also provides diagnostic information, such as warnings about divergent transitions, to help users check it has sampled the posterior faithfully.
In addition to encoding the statistical model, the Stan implementation includes prior distributions for the model parameters. Table 2        The software implementation, which is publicly available on GitHub, 1 has two idiosyncrasies worthy of discussion. First, the current implementation does not use any of the integrators provided by Stan to solve the system of ODEs in equation (2.10). Instead, a bespoke implementation of the explicit trapezoidal method [33] solves the system of ODEs for the transmission model. Anecdotally, the trapezoidal integrator significantly reduces runtime while producing acceptable numerical errors. Second, the current implementation only uses Stan's default initialization strategy for 1/φ deaths , 1/φ admissions , 1/φ calls , ρ admissions,1 , . . . , ρ admissions,K+1 , ρ calls,1 , . . ., and ρ calls,L+1 by drawing values uniformly between −2 and 2 on the unconstrained parameter space. Rather than doing this for α 1 , α 2 , β 1 , . . . , β J , d L , d I , d P and ω, the implementation draws uniformly from custom intervals to prevent initialization failures caused by unrealistic parameter values.

(b) Results
Mean scores for the forecasts that we generated with posterior samples for the parameters and point estimates for the parameters are presented in the top and bottom of table 3, respectively. The columns contain results for different values of the smoothness hyperparameter σ β defined in table 2. Smaller values of σ β make the random-walk prior on the effective contact rate β(t) tighter, causing it to vary more slowly and, if low enough, to underfit the data. Conversely, larger values of the smoothness hyperparameter loosen the random-walk prior and allow overfitting of the data if σ β is high enough. Generally, the mean scores in table 3 are lower, or closer to one in the case of NSES, for the forecasts generated with posterior samples for the parameters than for those generated with point estimates. This observation highlights the importance of allowing uncertainty propagation from statistical inference to forecasting. The differences between the values produced by the two forecasting methods are small for the lowest smoothness hyperparameter value of 0.0005 but increase with σ β until they become large and pronounced.
We have emboldened the best value for each of the mean scores, corresponding to the lowest value for the mean proper scores, LogS, QS, SphS, RPS, DSS and SES, and the value closest to and less than one for NSES. We select the best NSES value in this way because values of less than one correspond to over-cautious forecasts, which are arguably less damaging in terms of their impact on decision making than over-confident forecasts, which have NSES values of greater than one. The majority of the best mean scores are for the posterior samples for the parameters and a σ β value of 0.025. We can see that this forecast, shown in figure 2a, is over-cautious, which the NSES value correctly summarizes. We can also see the forecast for the point estimate for the parameters in  region of high probability mass, as it does for the Hybrid Rosenbrock distribution [34]. Although most of the best mean scores are for a σ β value of 0.025, the best values for the QS and SphS are for a smoothness hyperparameter value of 0.05. The differences, however, between the mean scores, 0.001 for QS and 0.006 for SphS, are relatively small and indicate little difference between the quality of the two forecasts when assessed with QS and SphS.
The forecast generated with the posterior samples for the parameters and a σ β value of 0.005 has an NSES value of 2.399, which indicates that it is over-confident. We can see in figure 2b that the forecast fails to predict most of the future observations and is indeed over-confident. The other scoring rules do not provide any information about the over-confidence of this forecast. We, therefore, believe that NSES's ability to distinguish between over-confidence and over-caution, given a single forecast, makes it a valuable diagnostic tool that should be used alongside proper scoring rules.

Conclusion
We have shown how to use simple scoring rules to develop a statistical model and improve its forecasting performance. The computational experiments presented in §4 demonstrate that the statistical model introduced in §2 provides the best forecasts when we use posterior samples for the parameters and a smoothness hyperparameter σ β value of 0.025. We, therefore, advocate simple scoring rules for evaluating epidemiological forecasts and NSES specifically to establish if they are over-confident or over-cautious.
One of the significant limitations of simple scoring rules is that we can only use them to assess forecasts of observable variables. Epidemiological modellers cannot apply them to important latent quantities, such as the effective reproduction number R t and the growth rate r, which they often forecast. Accordingly, the epidemiological community needs a method for assessing forecasts of quantities for which the truth is unknown. Simulation-based calibration (SBC) [35] is a candidate method for this task, worth investigating further.
There are four worthwhile directions in which we can extend the statistical model presented in §2. The first direction involves adding components to the observation model described in §2b to allow calibration with a greater quantity and diversity of surveillance data. The second direction involves modifying it to accommodate surveillance data from other countries. The third direction entails making the disease-specific, geography-independent parameters d L , d I , d T and ω global to facilitate information sharing between regions, as is done by Birrell et al. [1]. The fourth and final direction involves removing the assumption that recovered individuals are indefinitely immune to reinfection, allowing reinfection, which is more realistic.
Data accessibility. The code and data for the computational experiments are publicly available on GitHub (https://github.com/codatmo/UniversityOfLiverpool_PaperSubmission).